Skip to content

feat: unify config-driven retry across VLM and embedding#1049

Merged
MaojiaSheng merged 13 commits intovolcengine:mainfrom
snemesh:feature/unified-retry
Mar 31, 2026
Merged

feat: unify config-driven retry across VLM and embedding#1049
MaojiaSheng merged 13 commits intovolcengine:mainfrom
snemesh:feature/unified-retry

Conversation

@snemesh
Copy link
Copy Markdown
Contributor

@snemesh snemesh commented Mar 28, 2026

Summary

Closes #922

Unifies retry behavior across all VLM and embedding paths with a shared transient retry module, replacing scattered per-backend implementations.

Changes

  • New module openviking/models/retry.pyis_transient_error(), transient_retry(), transient_retry_async() with error classification (28 error types), exponential backoff with jitter, count-based retry
  • VLM — Remove function-level max_retries param from get_completion_async(). Retry is now config-driven via vlm.max_retries (default 3, was 2). Add retry to get_vision_completion_async() (was zero retry) and sync methods
  • Embedding — Add embedding.max_retries config (default 3). Apply unified retry to all 8 providers (OpenAI, Volcengine, VikingDB, Gemini, MiniMax, Jina, Voyage, LiteLLM)
  • SDK retry disabled everywhere to prevent double-retry explosion (SDK retry × our retry)
  • Kwargs migration — Switch VLM call chains from positional to keyword arguments (fixes pre-existing tool_choice/messages positional bug in vlm_config.py)
  • Backward compatibleexponential_backoff_retry() unchanged, max_retries=0 disables retry

Config

vlm:
  max_retries: 3    # default, was 2

embedding:
  max_retries: 3    # new field

Breaking Changes

  • max_retries parameter removed from VLMBase.get_completion_async(), StructuredVLM.complete_json_async(), StructuredVLM.complete_model_async(), VLMConfig.get_completion_async()
  • Retry is now fully config-driven

Error Classification

Retryable: HTTP 429/500/502/503/504, TooManyRequests, RateLimit, RequestBurstTooFast, ConnectionError, TimeoutError, openai.RateLimitError

Not retryable: HTTP 400/401/403/404/422, InvalidRequestError, AuthenticationError, unknown errors (conservative default)

Test Plan

  • 84 new tests across 5 files
  • Error classification: 28 parametrized cases (transient vs permanent)
  • Retry behavior: sync + async, backoff verification, edge cases
  • VLM backend integration: retry on 429, no retry on 401, vision retry, SDK disabled
  • Embedding provider integration: OpenAI + VikingDB retry verification
  • Config flow: end-to-end from ov.conf to backend instance
  • Backward compatibility: exponential_backoff_retry unchanged

@github-actions
Copy link
Copy Markdown

Failed to generate code suggestions for PR

@qin-ctx qin-ctx self-assigned this Mar 28, 2026
@snemesh snemesh force-pushed the feature/unified-retry branch from e73415d to b59fd97 Compare March 28, 2026 10:32
@MaojiaSheng
Copy link
Copy Markdown
Collaborator

there are some conflicts that need to be resolved, thanks

snemesh added 11 commits March 30, 2026 17:46
Implements `openviking/models/retry.py` with `is_transient_error`,
`transient_retry`, and `transient_retry_async` — a single config-driven
retry layer replacing scattered per-backend implementations.
Adds 50 unit tests covering classification, backoff, jitter, exhaustion,
and custom predicates.
- VLMBase: change max_retries default 2→3, remove max_retries param from
  get_completion_async abstract signature
- OpenAI backend: wrap all 4 methods with transient_retry/transient_retry_async,
  disable SDK retry (max_retries=0 in client constructors), remove manual
  for-loop retry
- VolcEngine backend: same pattern — transient_retry for all methods,
  remove manual for-loop retry
- LiteLLM backend: same pattern — transient_retry for all methods,
  remove manual for-loop retry
volcengine#922)

- VLMConfig: default max_retries 2→3, remove max_retries from
  get_completion_async signature, switch all wrappers to kwargs
- StructuredVLM (llm.py): remove max_retries from complete_json_async and
  complete_model_async, switch all internal calls to kwargs
- memory_react.py: remove max_retries=self.vlm.max_retries (now handled
  internally by backend)
- Update test stubs to match new signatures (remove max_retries=0)
Tests cover OpenAI backend as representative:
- Completion retries on 429, does NOT retry on 401
- Vision completion now retries (was zero before)
- Config max_retries is used (default=3)
- max_retries removed from get_completion_async signature (all backends)
- OpenAI SDK retry disabled (max_retries=0 in client constructors)
- EmbeddingConfig: новое поле max_retries (default=3) для конфигурации retry
- EmbeddingConfig._create_embedder(): инжектирует max_retries в params["config"]
- EmbedderBase.__init__(): извлекает max_retries из config dict
- OpenAI: отключить SDK retry (max_retries=0), обернуть embed/embed_batch
- Volcengine: заменить exponential_backoff_retry на transient_retry, убрать is_429_error
- VikingDB: добавить transient_retry (ранее retry отсутствовал)
- Gemini: отключить SDK HttpRetryOptions (attempts=1), обернуть embed/embed_batch
- MiniMax: отключить urllib3 Retry (total=0), обернуть embed/embed_batch
- Jina: отключить SDK retry (max_retries=0), обернуть embed/embed_batch
- Voyage: отключить SDK retry (max_retries=0), обернуть embed/embed_batch
- LiteLLM: обернуть litellm.embedding() вызовы

Все провайдеры теперь используют единый transient_retry с is_transient_error
для классификации ошибок. Wrapper размещён ВНУТРИ метода вокруг raw API call,
ДО try/except который конвертирует в RuntimeError.
- test_embedding_retry_integration: OpenAI и VikingDB retry на transient/permanent ошибки
- test_retry_config: VLMConfig и EmbeddingConfig max_retries поля и defaults
- test_backward_compat: exponential_backoff_retry importable, signature unchanged, time-based
After upstream refactored sync methods to use run_async(),
only transient_retry_async is needed in volcengine_vlm.py.
@snemesh snemesh force-pushed the feature/unified-retry branch from 6f9439b to 03ee476 Compare March 30, 2026 16:02
@snemesh
Copy link
Copy Markdown
Contributor Author

snemesh commented Mar 30, 2026

Conflicts resolved

Rebased onto current main. Here's a summary of the conflict resolution, particularly around volcengine_vlm.py which had the most significant changes.

Background

Two upstream commits landed after this PR was branched:

  • ce99887 — License change Apache-2.0AGPL-3.0 (touched all file headers)
  • a66a1a6Refactor memory extract v2 (Refactor memory extract v2 #1045) + VolcEngine Responses API with prefix caching

The license conflicts were trivial (accept upstream). The interesting part is how retry integrates with the new VolcEngine architecture.

volcengine_vlm.py — Retry + Responses API

Before (old architecture): VolcEngine used client.chat.completions.create() directly. Our retry wrapped that call:

async def _call():
    return await client.chat.completions.create(**kwargs)
response = await transient_retry_async(_call, max_retries=self.max_retries)

After (new architecture): Upstream refactored to use VolcEngine Responses API with prefix caching via responseapi_prefixcache_completion(). Sync methods now delegate through run_async().

Resolution: Retry wraps the high-level responseapi_prefixcache_completion call:

async def _call():
    return await self.responseapi_prefixcache_completion(
        static_segments=static_segments,
        dynamic_messages=dynamic_messages,
        response_format=response_format,
        tools=tools,
        tool_choice=tool_choice,
    )
response = await transient_retry_async(_call, max_retries=self.max_retries)

Why this is safe:

  • Retry is placed at the same abstraction level — around the API call, before response processing
  • Prefix cache is deterministic (same input → same cache key), so retries won't cause cache inconsistency
  • Sync methods (get_completion, get_vision_completion) delegate to async via run_async(), so they inherit retry automatically
  • get_vision_completion_async delegates to get_completion_async, so vision also gets retry
  • Only transient_retry_async is imported (removed unused transient_retry since sync paths use run_async)

memory_react.py — Accepted deletion

Upstream deleted this file in #1045. Our branch had modified the get_completion_async call to pass max_retries=. Since retry is now config-driven and max_retries was removed from the signature, this change is no longer needed. Accepted the deletion.

Other files

  • openai_vlm.py, llm.py, vlm_config.py — straightforward: accepted upstream formatting/license, applied our kwargs migration and retry on top
  • openai_embedders.py — kept both imports (transient_retry + DEFAULT_AZURE_API_VERSION)

Test results

All 84 tests from this PR pass. No regressions introduced (the 71 pre-existing failures on main are unchanged).

snemesh added 2 commits March 30, 2026 18:16
- Rename unused loop vars i, item → _i, _item (B007)
- Suppress unused has_images assignment (F841) — upstream code, kept for clarity
test_fs_tree intermittently returns 500 on windows-latest when
HAS_SECRETS=false — the resource processing pipeline retries
failed embeddings (401 auth errors) in the background, causing
server load that affects the fs/tree endpoint.
@MaojiaSheng MaojiaSheng merged commit a092d64 into volcengine:main Mar 31, 2026
11 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 31, 2026
@qin-ctx
Copy link
Copy Markdown
Collaborator

qin-ctx commented Mar 31, 2026

For context, we have also been working on a very similar implementation in parallel in #926.

Functionally, the direction is the same: move retry down to the model layer, make it config-driven, and apply it consistently across VLM and embedding instead of adding more call-site wrappers.

The main differences vs our in-flight branch are:

  • feat: unify config-driven retry across VLM and embedding #1049 lands the broad VLM + embedding unification and the config contract directly in one PR;
  • our branch also centralizes transient-error classification in a shared utility and reuses the same classification in the circuit-breaker path, so retry and breaker behavior stay aligned;
  • our current test coverage emphasis is a bit different, with some extra config/wiring and integration-path verification around the surrounding call flow.

Our follow-up plan on our side is to converge on a single bottom-layer retry implementation rather than keep two parallel versions, keep the config-driven contract as the baseline, and only carry forward the pieces that still add value on top of #1049 (for example shared classifier reuse and any targeted behavior/test gaps we find).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[Feature]: Unify config-driven retry across VLM and embedding

3 participants